{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Semantic Parsing with PyText.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XhQf5bN2vrFV",
        "colab_type": "text"
      },
      "source": [
        "# Semantic parsing with PyText\n",
        "\n",
        "[PyText](https://engineering.fb.com/ai-research/pytext-open-source-nlp-framework/) is a modeling framework that blurs the boundaries between experimentation and large-scale deployment.  In Portal, PyText is used for production natural language processing tasks, including [semantic parsing](https://en.wikipedia.org/wiki/Semantic_parsing).  Semantic parsing involves converting natural language input to a logical form that can be easily processed by a machine.  Portal by Facebook uses PyText in production to semantically parse user queries.  In this notebook, we will use PyText to train a semantic parser on the freely available [Facebook Task Oriented Parsing dataset](https://fb.me/semanticparsingdialog) using the newly open-sourced Sequence-to-sequence framework.  We will export the resulting parser to a Torchscript file suitable for production deployment.\n",
        "\n",
        "## The PyText sequence-to-sequence framework\n",
        "\n",
        "We have recently open sourced our production sequence-to-sequence (Seq2Seq) framework in PyText ([framework](https://github.com/facebookresearch/pytext/commit/ff053d3388161917b189fabaa0e3058273ed4314), [Torchscript export](https://github.com/facebookresearch/pytext/commit/8dab0aec0e0456fdeb10ffac110f50e6a1382e6c)).  This framework provides an encoder-decoder architecture that is suitable for any task that requires mapping a sequence of input tokens to a sequence of output tokens.  Our existing implementation is based on recurrent neural networks (RNNs), which have been shown to be [unreasonably effective](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) at sequence processing tasks.  The model we will train includes three major components\n",
        "  1. A bidirectional LSTM sequence encoder\n",
        "  2. An LSTM sequence decoder\n",
        "  3. A sequence generator that supports incremental decoding and beam search\n",
        "\n",
        "All of these components are Torchscript-friendly, so that the trained model can be exported directly as-is.  \n",
        "\n",
        "# Instructions\n",
        "\n",
        "The remainder of this notebook installs PyText with its dependencies, downloads the training data to the local VM, trains the model, and verifies the exported Torchscript model.  It should run in 15-20 minutes if a GPU is used, and will train a reasonably accurate semantic parser on the Facebook TOP dataset.  As detailed below, simply increasing the number of epochs will allow a competitive result to be obtained in about an hour of training time.  The notebook will also export a Torchscript model which can be used for runtime inference from Python, C++ or Java.\n",
        "\n",
        "It is *strongly recommended* that this notebook be run on a GPU."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z7JHMBzqzVm0",
        "colab_type": "text"
      },
      "source": [
        "# Installing PyText\n",
        "\n",
        "As of this writing, semantic parsing requires a bleeding-edge version of PyText.  The following cell will download the master branch from Github.\n",
        "\n",
        "If PyText is installed, the packages may change.  The notebook will restart in this case.  You may rerun the cell after this and everything should be fine."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-3OJLChpviSa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "try:\n",
        "  from pytext.models.seq_models.seq2seq_model import Seq2SeqModel\n",
        "  print(\"Detected compatible version of PyText.  Skipping install.\")\n",
        "  print(\"Run the code in the except block to force PyText installation.\")\n",
        "except ImportError:\n",
        "  !git clone https://github.com/facebookresearch/pytext.git\n",
        "  %cd pytext\n",
        "  !pip install -e .\n",
        "  print(\"Stopping RUNTIME because we installed new dependencies.\")\n",
        "  print(\"Rerun the notebook and everything should now work.\")\n",
        "  import os\n",
        "  os.kill(os.getpid(), 9)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AjnMr9VBzvi4",
        "colab_type": "text"
      },
      "source": [
        "# Downloading the data\n",
        "\n",
        "We'll use the Facebook Task Oriented Parsing (TOP) dataset for our example.  This dataset is publically available and can be added to the notebook with the following cells."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "S1bMMBc30CE2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!curl -o semanticparsingdialog.zip -L https://fb.me/semanticparsingdialog\n",
        "!unzip -o semanticparsingdialog.zip"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "c6l7IRaq5404",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "TOP_PATH = \"/content/top-dataset-semantic-parsing/\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "huPCJO7rL0ks",
        "colab_type": "text"
      },
      "source": [
        "## Preprocessing the data\n",
        "\n",
        "In the interest of time, we're going to simplify the data a bit.  The notebook contains instructions on how to use the full dataset if you prefer.\n",
        "\n",
        "Following [Gupta et al.](https://arxiv.org/pdf/1810.07942.pdf), we'll use a single-valued output vocabulary so that our model focuses on predicting semantic structure."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5hmE5Va5N44C",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import random\n",
        "import re\n",
        "from contextlib import ExitStack\n",
        "\n",
        "def make_lotv(input_path, output_path, sample_rate=1.0):\n",
        "  with ExitStack() as ctx:\n",
        "    input_file = ctx.enter_context(open(input_path, \"r\"))\n",
        "    output_file = ctx.enter_context(open(output_path, \"w\"))\n",
        "    for line in input_file:\n",
        "      if random.random() > sample_rate:\n",
        "        continue\n",
        "      raw_seq, tokenized_seq, target_seq = line.split(\"\\t\")\n",
        "      output_file.write(\n",
        "          \"\\t\".join(\n",
        "              [\n",
        "                raw_seq,\n",
        "                tokenized_seq,\n",
        "                # Change everything but IN:*, SL:*, [ and ] to 0\n",
        "                re.sub(\n",
        "                    r\"(?!((IN|SL):[A-Z_]+(?<!\\S))|\\[|\\])(?<!\\S)((\\w)|[^\\w\\s])+\",\n",
        "                    \"0\",\n",
        "                    target_seq\n",
        "                )\n",
        "              ]\n",
        "          )\n",
        "      )\n",
        "\n",
        "# Running on the full test set takes around 40 minutes, so using a reduced set\n",
        "# here.  Change to (\"test.tsv\", 1.0) to evaluate on the full test set.\n",
        "for f, r in [(\"train.tsv\", 1.0), (\"eval.tsv\", 1.0), (\"test.tsv\", 0.1)]:\n",
        "  make_lotv(f\"{TOP_PATH}{f}\", f\"{TOP_PATH}lotv_{f}\", r)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "vAdHF89SSHVv",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!head /content/top-dataset-semantic-parsing/lotv_train.tsv"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "C3AY48fqcI1X",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Comment out the next line to train on the original target instead of the\n",
        "# limited output vocabulary version.\n",
        "TOP_PATH = f\"{TOP_PATH}lotv_\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tWZb3AhB4X_6",
        "colab_type": "text"
      },
      "source": [
        "# Preparing the PyText configuration\n",
        "\n",
        "## Data configuration\n",
        "\n",
        "PyText includes components to iterate through and preprocess the data."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pdtbU_ZM0px3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from pytext.data import Data, PoolingBatcher\n",
        "from pytext.data.sources import TSVDataSource\n",
        "\n",
        "top_data_conf = Data.Config(\n",
        "    sort_key=\"src_seq_tokens\",\n",
        "    source=TSVDataSource.Config(\n",
        "        # Columns in the TSV.  These names will be used by the model.\n",
        "        field_names=[\"raw_sequence\", \"source_sequence\", \"target_sequence\"],\n",
        "        train_filename=f\"{TOP_PATH}train.tsv\",\n",
        "        eval_filename=f\"{TOP_PATH}eval.tsv\",\n",
        "        test_filename=f\"{TOP_PATH}test.tsv\",\n",
        "    ),\n",
        "    batcher=PoolingBatcher.Config(\n",
        "        num_shuffled_pools=10000,\n",
        "        pool_num_batches=1,\n",
        "        train_batch_size=64,\n",
        "        eval_batch_size=100,\n",
        "        # Testing relies on ScriptedSequenceGenerator, which\n",
        "        # does not support batch sizes > 1\n",
        "        test_batch_size=1,\n",
        "    ),\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-mWvOGLWMMP9",
        "colab_type": "text"
      },
      "source": [
        "## Model configuration\n",
        "\n",
        "We can use the object below to specify the architecture for the Seq2seq model"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-MNwVYKPLe_Q",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from pytext.data.tensorizers import TokenTensorizer\n",
        "from pytext.loss import LabelSmoothedCrossEntropyLoss\n",
        "from pytext.models.embeddings import WordEmbedding\n",
        "from pytext.models.seq_models.rnn_decoder import RNNDecoder\n",
        "from pytext.models.seq_models.rnn_encoder import LSTMSequenceEncoder\n",
        "from pytext.models.seq_models.rnn_encoder_decoder import RNNModel\n",
        "from pytext.models.seq_models.seq2seq_model import Seq2SeqModel\n",
        "from pytext.models.seq_models.seq2seq_output_layer import Seq2SeqOutputLayer\n",
        "from pytext.torchscript.seq2seq.scripted_seq2seq_generator import (\n",
        "    ScriptedSequenceGenerator\n",
        ")\n",
        "\n",
        "seq2seq_model_conf=Seq2SeqModel.Config(\n",
        "    # Source and target embedding configuration\n",
        "    source_embedding=WordEmbedding.Config(embed_dim=200),\n",
        "    target_embedding=WordEmbedding.Config(embed_dim=512),\n",
        "    \n",
        "    # Configuration for the tensorizers that transform the \n",
        "    # raw data to the model inputs\n",
        "    inputs=Seq2SeqModel.Config.ModelInput(\n",
        "        src_seq_tokens=TokenTensorizer.Config(\n",
        "            # Output from the data handling.  Must match one of the column\n",
        "            # names in TSVDataSource.Config, above\n",
        "            column=\"source_sequence\",\n",
        "            # Add begin/end of sequence markers to the model input\n",
        "            add_bos_token=True,\n",
        "            add_eos_token=True,\n",
        "        ),\n",
        "        trg_seq_tokens=TokenTensorizer.Config(\n",
        "            column=\"target_sequence\",\n",
        "            add_bos_token=True,\n",
        "            add_eos_token=True,\n",
        "        ),\n",
        "    ),\n",
        "    # Encoder-decoder configuration\n",
        "    encoder_decoder=RNNModel.Config(\n",
        "        # Bi-LSTM encoder\n",
        "        encoder=LSTMSequenceEncoder.Config(\n",
        "            hidden_dim=1024,\n",
        "            bidirectional=True,\n",
        "            dropout_in=0.0,\n",
        "            embed_dim=200,\n",
        "            num_layers=2,\n",
        "            dropout_out=0.2,\n",
        "        ),\n",
        "        # LSTM + Multi-headed attention decoder\n",
        "        decoder=RNNDecoder.Config(\n",
        "            # Needs to match hidden dimension of encoder\n",
        "            encoder_hidden_dim=1024,\n",
        "            dropout_in=0.2,\n",
        "            dropout_out=0.2,\n",
        "            embed_dim=512,\n",
        "            hidden_dim=256,\n",
        "            num_layers=1,\n",
        "            out_embed_dim=256,\n",
        "            attention_type=\"dot\",\n",
        "            attention_heads=1,\n",
        "        ),\n",
        "    ),\n",
        "    # Sequence generation via beam search.  Torchscript is used for \n",
        "    # runtime performance\n",
        "    sequence_generator=ScriptedSequenceGenerator.Config(\n",
        "        beam_size=5,\n",
        "        targetlen_b=3.76,\n",
        "        targetlen_c=72,\n",
        "        quantize=False,\n",
        "        nbest=1,\n",
        "    ),\n",
        "    output_layer=Seq2SeqOutputLayer.Config(\n",
        "        loss=LabelSmoothedCrossEntropyLoss.Config()\n",
        "    ),\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tjzwk9ruXDmd",
        "colab_type": "text"
      },
      "source": [
        "## Task configuration\n",
        "\n",
        "Given the data and model configurations, it is straightforward to configure the PyText task.  "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8sg-yItfWno7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from pytext.optimizer.optimizers import Adam\n",
        "from pytext.optimizer.scheduler import ReduceLROnPlateau\n",
        "from pytext.task.tasks import SequenceLabelingTask\n",
        "from pytext.trainers import TaskTrainer\n",
        "\n",
        "seq2seq_on_top_task_conf = SequenceLabelingTask.Config(\n",
        "    data=top_data_conf,\n",
        "    model=seq2seq_model_conf,\n",
        "    # Training configuration\n",
        "    trainer=TaskTrainer.Config(\n",
        "        # Setting a small number of epochs so the notebook executes more \n",
        "        # quickly.  Set epochs=40 to get a converged model.  Expect to see \n",
        "        # about 4 more points of frame accuracy in a converged model.\n",
        "        epochs=10,\n",
        "        # Clip gradient norm to 5\n",
        "        max_clip_norm=5,\n",
        "        # Stop if eval loss does not decrease for 5 consecutive epochs\n",
        "        early_stop_after=5,\n",
        "        # Optimizer and learning rate\n",
        "        optimizer=Adam.Config(lr=0.001),\n",
        "        # Learning rate scheduler: reduce LR if no progress made over 3 epochs\n",
        "        scheduler=ReduceLROnPlateau.Config(patience=3),\n",
        "    ),\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TZi5B4YbaRbo",
        "colab_type": "text"
      },
      "source": [
        "## Complete the PyText config\n",
        "\n",
        "PyText bundles the task along with several environment settings in to a single config object that's used for training."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "M31JMHfiZUaB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from pytext.config import LATEST_VERSION as PytextConfigVersion, PyTextConfig\n",
        "\n",
        "pytext_conf = PyTextConfig(\n",
        "    task=seq2seq_on_top_task_conf,\n",
        "    # Export a Torchscript model for runtime prediction\n",
        "    export_torchscript_path=\"/content/pytext_seq2seq_top.pt1\",\n",
        "    # PyText configs are versioned so that configs saved by older versions can\n",
        "    # still be used by later versions.  Since we're constructing the config\n",
        "    # on the fly, we can just use the latest version.\n",
        "    version=PytextConfigVersion,\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "__YZR042bwsZ",
        "colab_type": "text"
      },
      "source": [
        "# Train the model\n",
        "\n",
        "PyText uses the configuration object for training and testing."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "W6VFB_E1bJn_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from pytext.workflow import train_model\n",
        "\n",
        "model, best_metric = train_model(pytext_conf)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oE8G-sFoV3Be",
        "colab_type": "text"
      },
      "source": [
        "# Test the model\n",
        "\n",
        "PyText training saves a snapshot with the model and training state.  We can use the snapshot for testing as well."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LMvEWLsijqQ7",
        "colab_type": "text"
      },
      "source": [
        "TODO: this is too slow.  We should see if we can get away with parallelism, and if not significantly downsample the test set."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "G0hUzhuPXEDm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from pytext.workflow import test_model_from_snapshot_path\n",
        "\n",
        "test_model_from_snapshot_path(\n",
        "    \"/tmp/model.pt\",\n",
        "    # Sequence generation is presently CPU-only\n",
        "    False,\n",
        "    None,\n",
        "    None,\n",
        "    \"/content/pytext_seq2seq_top_results.txt\"\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "49usGDfbUZyX",
        "colab_type": "text"
      },
      "source": [
        "# Using the model at runtime\n",
        "\n",
        "The exported Torchscript model can be used for efficient runtime inference.  We will demonstrate the API in Python here, but the file can also be loaded in [Java](https://pytorch.org/javadoc/) and [C++](https://pytorch.org/cppdocs/)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YK0Wb6lgdF8h",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import torch\n",
        "scripted_model = torch.jit.load(\"/content/pytext_seq2seq_top.pt1\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zmS8jHwAT3g6",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# To use the exported model, we need to manually add begin/end of sequence\n",
        "# markers.\n",
        "BOS = \"__BEGIN_OF_SENTENCE__\"\n",
        "EOS = \"__END_OF_SENTENCE__\"\n",
        "scripted_model(f\"{BOS} what is the shortest way home {EOS}\".split())"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9t0irQRJUD82",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
